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t— I ■ Abstract 

A qualitative comparison of total variation like penalties (total variation, Huber variant 
of total variation, total generalized variation, . . . ) is made in the context of global seismic 
tomography. Both penalized and constrained formulations of seismic recovery problems are 
•^T treated. A number of simple iterative recovery algorithms applicable to these problems are 

described. The convergence speed of these algorithms is compared numerically in this setting. 
For the constrained formulation a new algorithm is proposed and its convergence is proven. 
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CU 1 Introduction 

The aim of this paper is two-fold. To give a qualitative description of various total variation-like 
regularization methods in the context of a global seismic inversion problem, and to compare 
several iterative algorithms numerically, that can be used to perform these inversions. A new 

' 55 ' algorithm is also included and a proof of convergence is given. 

Inverse problems in seismic tomography are characterized by a combination of insufficient 
and noisy data. It follows that resulting tomographic reconstructions of wave-speed anomalies 
are non-unique and corrupted by noise; some kind of regularization of the inverse problem 
is required. The dynamics of the Earth's mantle leads to smooth variations (as a result of 
heat diffusion) as well as steep gradients (related to the exponential dependence of viscosity on 
temperature) , and sharp transitions (resulting from chemical and mineralogical variations caused 

_i_ ■ by lithosphere subduction and from phase transitions). An important challenge is to retain as 

much of this in the reconstruction despite the need to regularize. The attraction of simple 

(T) | smoothing algorithms giving simple images, is off-set by smoothing away sharp boundaries. 

Here we focus on regularization techniques that have the ability to reconstru ct sharp edges. 
Included in our discussion are the total variation prior (TV) of iRudin et al.1 1992 ] and a number 
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of generalizations and variations on this theme. Among them are the so-called Hu ber variant of 



the to tal variation prior (HTV), the total generalized variation method (TGV) of lBredies et al. 



20101 ]. etc. We also include a regularization method that uses a sparse expansion in terms of 



wavelet coefficients. The methods are described in explicit detail and some typical features of 
resulting reconstructions are discussed qualitatively in Section [2J 

Although details differ, the common theme of these penalties is that they all impose sparsity 
of certain local differences of the reconstructed model, which is achieved in practice by using a 
non-smooth convex £i-norm penalty of these local differences. The effect of using an £i-norm 
instead of an ^2-norm squared is that in practice small coefficients are penalized disproportion- 
ately more than large coefficients, which leads to sparse reconstructions (i.e. reconstructions 
with few nonzero coefficients). 



The use of an £i-norm penalty implies that non-linear equations have to be solved in the 
inversion. We therefore devote a section of the paper to the numerical comparison of the con- 
vergence speed of some iterative reconstruction algorithms for these problems. Indeed, a second 
common feature of these regularization methods is that they can be solved with very similar 
algorithms (with minor changes). 

In this paper we will assume a linear relationship between an unknown model u and data 
y characterized by a matrix K. One may think e.g. of the matrix K as containing in its rows 
the discretized versions (on some grid) of ray or finite frequency sensitivity kernels. The central 
mathematical theme of this paper is then the numerical minimization of some penalized least 
squares functionals of type: 

min — \\Ku — y\\ + penalty. (1) 

u 2 

The first term in this functional represents a quadratic data misfit term that depends on the 
data y and the matrix K: \\Kx — y\\ 2 = ^2i(Kx — y) 2 . The second term depends on u and 
serves to regularize the inversion, i.e. it serves to produce a unique model u that satisfies some 
(qualitative) assumptions imposed on the reconstruction. Here we are principally interested in 
penalizing local differences Au of the model u in such a way that edges in the model aren't 
blurred too much. To this end, we will study the problem: 

u(A) = argmin -||Kii — y\\ 2 + A||Au||i. (2) 

u 2 

that contains a non-smooth convex £i-norm penalty of local differences Au of u. The precise 
choice of the differencing matrix A and the precise form of the £i-norm || ■ ||i depend on the 
penalty that is preferred and on the particularities of the model (2D, 3D etc,...). Fully worked 
out examples of this kind of penalty, which include the TV penalty and generalizations, are 
presented in Section [21 with further details for solving the associated problem ([5]) in Sections 
andU 

Alternatively, instead of penalizing a least squares term, one may try to solve the constrained 
problem: 

u(e) = arg min ||Au||i. (3) 

\\Ku—y\\<e 

The problems ([2]) and (J3j) are equ ivalent in the sense that for corresponding A and e, the models 



-u(A) and u(e) are equal; see e.g. [Hennenfent et al.l . 120081 . Ivan den Berg and Friedlanderl . 120111 ] 



for the case A = 1. The formulation ([3j) is useful as the parameter e (related in practice to the 
amount of noise on the data y) is perhaps easier to estimate than the penalty parameter A in 

©■ 

In Section [3] we write explicit formulas for several iterative algorithms for the penalized 

problem ([2]) and we compare the speed of convergence (in the special case of the TV penalty). 

The comparison is done for a matrix K that has no special structure (not the identity matrix, 

not convolution, ...). We also discuss two iterative algorithms for the constrained problem ([3]). 

One we believe is new, and we prove convergence. We also make a numerical comparison of the 

speed of convergence in the constrained case. 

Except for one, the algorithms discussed in this paper are fully explicit. They require only 
application of matrix vector products (with matrices A and K), and one or two simple convex 
projections. It is important to remark that only the precise form of these projections and 
the choice of the local differencing operator A differ between the various penalization methods 
discussed in this paper. Again fully worked out examples are given for the penalties discussed 
in Section [2j 

In section [2] we will review a number of non smooth regularization terms in the context 
of a toy synthetic seismic tomography experiment. We discuss the effect the various penalty 



choices have on the reconstruction. Section [3] lists a number of iterative algorithms that can be 
used to perform these inversions. A numerical convergence speed comparison is made, both in 
the penalized case ([2]) as well as in the constrained case (J3|). Section H] explains in detail how 
these algorithms can be used to solve the problem introduced in Section [21 i.e. we give explicit 
expressions for the convex projection operators used by the various penalties. Finally, Section [5] 
contains a proof of convergence of a new iterative minimization algorithm for the constrained 
problem ([3]). 

2 Comparison of edge-preserving regularization methods 

Natural images are often characterized by the presence of both sharp edges and smooth transi- 
tions. It is therefore not surprising that much research is done for finding methods of denoising 
or deconvolving ima ges that can preser ve sharp edges. A popular technique is the use of the 
total variation prior Rudin et al.l . ll992j ]. and many generalizations have been proposed. In this 



section we review a number of these TV-like techniques, compare synthetic 2D reconstructions of 
a toy model in global seismic tomography and describe their principal features. The description 
of computational algorithms for these tasks is deferred to Section [3] and specific implementation 
details for each of the cases discussed are given in Section [U 

2.1 Total variation penalty and generalizations 

The total variation (TV) penalty is defined by the £i-norm of the gradient of the model u, i.e. 
we choose A = grad so that the penalty in expressions ([T]) and ([5]) becomes: 



penalty = A||Au||i = A||grad(«)||i = A ^ ^ {A x u) 2 + (A y u) 2 . (4) 

pixels 

Here grad(-u) = (A x u, A y u) and A x u and A y u are first order local differences of the 2D model 
u. In formula §£§ the sum ranges over all pixels of the model u. This type of regularization will 
force the gradient of the model to be (exactly) zero in many places: TV promotes sparsity of the 
gradient of u. In other words, it will give rise to a piece-wise constant reconstruction whenever 
the data allows for it. This property is lost when the square root in expression (JU) is removed. 
This penalty is translation invariant and isotropic (invariant under rotations). Sometimes a 
non-isotropic TV is used which is defined as a sum over all pixels of |A x w| + |Ayu|. Formula ([3D 
is easily adapted to 3D models too. 

A slight generalization of the TV penalty is the Huber total variation penalty (HTV). Here 
we replace the y/(A x u) 2 + (A y u) 2 terms of (jl|) by: 

penalty = A £ h L] {A x u) 2 + (A y u)A (5) 

pixels 

where the function h is defined by 

hM J l*| 2 /2« if \t\ < a 

h ^ = \\t\-a/2 if \t\>a (6) 



(see [Huber! . 11964 Section 4, point (iii)]). Clearly HTV reduces to TV for a = 0. On the other 
hand, when a is positive, large gradients are penalized as in TV (up to an irrelevant constant), 
but small gradients are penalized quadratically so that the sparse gradient promoting property 
of the TV penalty is lost. In practice, local differences will be kept small by this penalty (if 
possible), but non will be exactly zero as in TV. 



A third kind of regularization met hod that we will co nsider is a (non-symmetric) total 
generalized variation penalty (TGV) of Bredies et al.l . |2010| . remark 3.10]. It also involves 
auxiliary variable v = (v x ,v y ) and the problem ([2]) takes the form 



an 



min —\\Ku — y\ 

21 ! -y \ 



+A Y^ \/( A x u ~ ^) 2 + ( A v u ~ v v) 2 + a^(A x v x ) 2 + (A y v x ) 2 + (A x v y ) 2 + {A y v y f 

pixels 



(7) 



(see Section H] for more details). This generalization was designed to yield piece- wise smooth 
models (instead of piece- wise constant models as in standard TV). The penalty depends on an 
additional parameter a that controls the balance of the first and second term. Many other 
generalizations of the T V penalty exist; see e.g. Chambolle and Liond . Il997l . iBredies et all 



201(1 IChan et alll200d |. 



A straightforward variation on the TV penalty is the use of the ^i-norm, not of the gradient 
of the model, but of second order differences of the model (i.e. of the local Hessian matrix; 
A = Hess). We will call it the Hessian penalty (HP): 



penalty 



AHA* 

pixels 



A||Hess(u) 



(A x u) 2 + (A x A y u) 2 + (A y A x u) 2 + (Aluf 



(8) 



(in each pixel Hess(-u) is a 2 x 2 matrix; see again Section H] for details). The use of penalty 
([8]) will force the local Hessian of the model to be zero in most places, i.e. the model will be 
piecewise linear. Here, any matrix norm on the local Hessian can be chosen; those matrix norms 
that are expressed in terms of the eigenvalues will yield a penalty that is isotropic. We choose 
the Frobenius norm as it is easy to work with and isotropic (an explicit expression is given in 
Section |U). 

Regularization strat egies that try to express the model as a sparse linear combination of 
wavelet basis functions [Mallatl . 120091 ] also fit in the category of penalties that use the £i-norm 
of local differences. In that case one has 



penalty = A||Au||i = \\\Wu\ 



(9) 



were W represents a wavelet transform. A wavelet transform W interleaves local differencing 
(and averaging) with subsampling. Such a penalty depends directly o n the choice of wavelet 
basis (choice of W). On a cartesian grid, many wavelet families exist Mallatl . 120091 ] and they 
are relatively easy to implement (although not as easy as the local differencing used in TV 
and its variations). Examples include non-smooth wavelet functions such as the simple Haar 
wavelets, smooth orthogonal wavelets or smooth symmetric wavelets. However, wavelets are 
non-stationary (preferred positions exist) and 2D and 3D wavelets are non-isotropic (preferred 
directions exist). Partial solutions to these two problems include the use of the un decimated W 



or of directional transforms such as curvelets Candes et al.l. I2006H o r shea r lets Labate et al. 
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Such a strategy has been used in geosciences in 



Lor is et al. 



20081 . iHerrmann and Hennenfentl . [2008J, iGholami and Siahkoohil . [201CJ, ISimons et all 120111 ] etc. 



2007, 



Hennenfent et al. 



2.2 Qualitative comparison on a synthetic example 

In the remaining part of this section we apply the various regularization methods described above 
to a toy problem in seismic tomography. We consider a simple synthetic 2D input model u mput 
defined on the globe (see Figure [TJ panels a-e). The input model is chosen to have a number 
of zones of constant value (+1 in blue color or —1 in red color) with either a sharp edge or a 



smooth transition in between. Sharp edges are found near North-America. Smooth transitions 
are found around the Indian Ocean. The edges and transitions are circle-shaped so as not to 
give preference to any particular direction (i.e. edges are not aligned with the parametrization 
grid of the sphere). The center of this 'bull's eye' pattern is located at (36°N, — 120°E). The 
model has circular symmetry around this point. 

In panel (b) of Figure [H a cross section of this model is shown along a great circle passing 
through the point (36°N, — 120°E) and through the North Pole. The second row of Figure [1] 
(panels c and d) depict the length of the local gradient of the input model: |grad(it mp )|. The 
gradient is sparse; n onzero values can o nly be found near edges and transitions in the model. 
The 'cubed sphere' Ronchi et all 1 19961 ] was used as a parametrization of the sphere. In this 



toy experiment the model space has dimension 98304 (= 128 x 128 x 6 pixels 

In order to set up a synthetic inverse problem, we us e a set of 8490 seismic rays correspond- 
ing t o actual earthquake positions and seismic stations Trampert and Woodhousd . Il995l . Il996 



20011 ] . The rays (discretized on the 128 x 128 x 6 grid) make up the rows of a 8490 x 98304 
matrix K. In panel (f) of Figure [1] the sum of all rows of the matrix K is shown. It represents 
the illumination of the globe by the 8490 ray paths. Most rays are concentrated around the 
Pacific Ocean. The matrix K does not have any special structure that can be exploited by a 
minimization algorithm. 

Next, artificial data y are constructed using the formula y = Ku mpnt + n, where n is chosen 
as gaussian noise of magnitude ||n|| = 0.1 x ||i , fu mput | (in other words 10% gaussian noise is 
added). Our goal is to use the different regularization techniques described in Section [27T1 to 
obtain faithful reconstructions of u mput (from the knowledge of y and K only) , and to compare 
some of their principal characteristics. 

We do not expect perfect reconstruction ( w out P ut ^ ^ m P ut ) because the problem is too under- 
determined (only 8490 data for 98304 unknowns), and because the data contain noise. The use of 
TV style regularization methods is appropriate as the model has several areas of constant model 
value, together with some edges. However, we expect that the TV penalty will unfortunately 
also enforce piecewise constant model values near the smooth transitions. We compare with the 
other regularization methods. 

Four different reconstructions are made, corresponding to TV, HTV, TGV and HP penalties 
as described by formulas ([!]), ©, © and ^ in combination with functional t^fy. The precise 
iterative algorithm that is used to solve these four instances of problem ([2]) is described in 
Section El formula (fTT]) . A detailed description is deferred to Section HI At the moment we 
limit ourselves to discussing some qualitative differences of the resulting output models. It 
is important to mention right away that all reconstructed models fit the data equally well: 
||^ n output _ y|| — || n ||_ xhe four reconstructions are displayed in Figure [2 

In Figure [21 a map view of the four reconstructed models is displayed on the left hand 
side and a cross section of the four output models is shown in the right hand side column (for 
reference also a cross section of the input model is shown in dotted lines). As before the cross 
sections follow the great circle that passes through the point (36°N, — 120°E) and through the 
North pole. Note however that the output models have lost (some of) their circular symmetry, 
and a different cross section (along another great circle through the point (36°N, — 120°E)) will 
yield a slightly different profile. 

The piece-wise constant nature of the TV output model is clearly noticeable in panels (a) 
and (b). There is also some loss of amplitude in certain regions. Distinctively, sharp edges (e.g. 
near North America) are preserved but smooth transitions of the input model (e.g. near Africa 
and the Indian Ocean) are replaced by a succession of sharp edges (the so-called staircasing 
effect). In other words a TV penalty imposes a piecewise constant reconstruction as much as 
the data allows for it. Such a reconstruction always has sharp edges, but it is not guaranteed 
that the reconstructed edge will be at exactly the same location as in the input model (the 
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(b) Input model: u input (cross section) 
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Figure 1: Toy input model for synthetic seismic tomography experiment (see Section [2]): (a) 
input model u mput with both sharp and smooth edges between zones of constant model value; 
(b) cross section along a great circle passing through (36°N, — 120°E) (dashed line in panel (a)); 
the horizontal axis measures the degrees of separation from this point in Northern direction; (c) 
length of the local gradient of input model (mostly zero except at edges and transitions); (d) 
histogram of the length of the gradient (on a logarithmic scale; the peak at —16 corresponds 
zero gradients); (e) Histogram of model values; (f) Illumination of the globe by the 8490 ray 
paths in the matrix K. 



(a) TV output model (map view 



(b) TV output model (cross section) 
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Figure 2: Reconstructions of the input model for four different regularization methods. The cross 
sections on the right show the output models (solid lines) and the input model (dashed lines). 
(a)-(b) the total variation reconstruction has an obvious piecewise constant character. The 
amplitude of the model appears damped in some regions, (c)-(d) the Huber-TV reconstruction 
is similar to the TV reconstructions but the staircasing effect is less pronounced, (e)-(f) the 
TGV reconstruction, (g)-(h) the Hessian style reconstruction is piecewise linear. 



(a) Haar wavelets output model (map view) 




(b) Haar wavelets output model (cross section) 
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Figure 3: Reconstructions of the input model for two different wavelet regularization methods. 
The cross sections on the right show the output models (solid lines) and the input model (dashed 
lines), (a)-(b) a reconstruction that is sparse in the Haar wavelet basis, (c)-(d) a reconstruction 
that is sparse in the CDF4-2 wavelet basis. The Haar model uses only 1668 nonzero wavelet basis 
coefficients, and the CDF 4-2 reconstruction uses a mere 996 nonzero wavelet basis coefficients. 



position is influenced by noise on the data and by the limited number of rays). 

The Huber TV reconstruction looks similar to the TV reconstruction but it has edges that 
are somewhat smoother. The piecewise constant nature of the reconstruction is lost. It should 
be said that the precise amount of smoothing depends on the value of the extra parameter a 
that is present in the Huber penalty. For a = one recovers the TV reconstruction, but for 
large a one obtains a reconstruction that is penalized by the ^-norm squared of the gradient. 

The TGV reconstruction exhibits edges that are comparable to the Huber TV reconstruc- 
tions, but amplitudes are damped less. Some edges are replaced by linear transitions. The TGV 
penalty also depends on a parameter a that can be tuned between the TV case (large a) and 
the HP case (small a). Linear transitions also characterize the HP reconstruction. Here the 
penalty is proportional to the 1-norm of the second derivative of the model. For the example 
shown, the difference between the TGV reconstruction lies somewhere between the TV and HP 
reconstruction. 

Finally, Figure [3] shows two reconstructions that were obtained by imposing a £i-norm penalty 
of wavelet coefficients of the model. Panels (a) and (b) of this Figure show a reconstruction that 
is sparse in the Haar wavelet basis. The number of nonzero basis coefficients is 1668 (out of a 
possible 98304). These wavelets are orthonormal, but non-smooth. The sec ond example ( panel s 



possible yodU4j. inese wavelets are ortnonormal, but non-smootn. ine sec ond example ( panel s 
c and d) uses a wavelet family with very smooth wavelets (CDF 4-2) of I Cohen et al.1 [19921 ] . 



This model has only 996 nonzero coefficients (out of a possible 98304). The smoothness of the 
wavelet bases functions is clearly reflected in the nature of the reconstructions shown in Figure 
These two reconstruction fit the 8490 data as well as the four previous reconstructions. The 
Haar wavelet reconstruction is not visually appealing. Even though the Haar basis functions 
are piece-wise constant, the Haar reconstruction does not look similar to the TV reconstruction. 
If wavelets are used, one not only needs to choose the wavelet family (Haar, . . . ), but also the 
number of levels of the wavelet transform. In this example, we chose to make a 4 level wavelet 
transform. 

The original input model possesses an azimuthal symmetry around the point (36°N, — 120°E). 
However, the reconstructed models have lost this symmetry. This is due to the lack of data, 
the uneven distribution of the ray coverage, the noise on the data and the penalties imposed. 
For instance, the (circular) edges that are present in the input model are reconstructed, but the 
precise position is changed depending on this azimuthal angle. Instead of showing a single cross 
section of the output models, as was done in the second column of Figures [2] and [3j it makes 
sense to also calculate the cross sections along many great circles passing through the same point 
(36°N, — 120°E) and average them. The average over 36 great circles is displayed in FigureHl in 
the case of the TV reconstruction one sees that such an average cross section does not exhibit 
the same sharp edges as a single cross section does. The cause is simply that the output model 
is not circularly symmetric around this point. In other words the reconstructed edges do not 
necessarily coincide with the original edges. Averaging this effect leads to some smoothing. This 
also occurs in the Haar reconstruction. The effect is less evident on the other reconstruction 
that are already smooth. Figure [5] shows a map view of the difference |^ out P ut — -u m P ut | between 
input model and reconstruction. 

Figure shows the length of the local gradient of four reconstructed models in map view 
and as a histogram. These should be compared to Figure [IJ panels (c) and (d). The gradient 
of the TV reconstructed model in panels (a) and (b) is sparse, although |grad(u out )| is nowhere 
exactly equal to zero. This is a result of stopping the iterative reconstruction algorithm after a 
finite number of iterations (in this case after 1000 iterations). This behavior is best seen on the 
histogram where the peak between 10 -4 and 10 -6 represents values of |grad(n out )| that have not 
fully converged to zero. A black line shows the position of the histogram after 10 5 iterations. 
We can clearly see that the secondary peak has moved to the left, i.e. to values around 10 -8 . 
The primary peak stays roughly in the same position, indicating little change to the significant 



(a) TV output model (average cross section) 



(b) Huber TV output model (average cross section) 
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Figure 4: Averaged cross sections of the six reconstructions. The average is taken over 36 great 
circles passing through the point (36°N, — 120°E), and with azimuth 0,10,20,... ,350 degrees 
w.r.t local North. 



10 



(a) TV: luoutput.yinput 



(b)HuberTV: ^output _ ^inputj 




Figure 5: Map view of the differences |<u out P ut — <u m P ut | for the six reconstructions. Large errors 
may occur near edges (as a result of smoothing or change of edge position in <u out P ut ) an d in 
other places (as a result of amplitude damping in u° pu ). 
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(a) TV: |grad(u out P ut )| (map view 



(b) TV: log 10 |grad(u out P ut )| (histogram) 
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(c) Huber TV: |grad(u out P ut )| (map view) (d) Huber TV: log 10 |grad(u out P ut )| (histogram) 
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(e) TGV: |grad(u out P ut )| (map view) (f) TGV: log 10 |grad(u out P ut )| (histogram) 
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(g) HP: |grad(u out P ut )| (map view) (h) HP: log 10 |grad(u out P ut )| (histogram) 
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Figure 6: Gradient field of four reconstructed models of Section [21 On the left hand side is a 
map view of the norm of the local gradient of the output models. On the right hand side is a 
histogram (of the logarithm) of the norm of the local gradient of these output models (in blue, 
after 1000 iterations; a black line indicates the position of the histogram after 10 5 iterations). 
For the first reconstruction (TV), one expects this gradient field to be zero in most places. The 
Huber-TV reconstruction penalizes the local gradient but small (nonzero) values remain. For the 
last two reconstructions one does not expect sparse (or almost sparse) gradients, but a piecewise 
constant gradient. 



values of the gradient after 1000 iterations. 

On the second row of Figure [6] we see that the Huber TV penalty does not lead to sparse 
gradients (for a^O). Although many values are small (colored in white on the map view), 
there is no heavy tail visible on the left hand side of the corresponding histogram in panel (d). 

The TGV and HP penalties (panels (e)-(h)) give rise to local gradients that are markedly 
different in character from the TV and Huber TV cases. Here, as in the Huber TV case, we 
do not expect sparse gradients. The HP reconstruction imposes a piece-wise linear solution, 
as much as the data allows for. Therefore the gradient of the output model will be piecewise 
constant. This is clearly observed in panel (e) of Figure El In case of the TGV reconstruction, 
the results lie somewhere between the TV reconstruction and the HP reconstruction. 

In case of the HP reconstruction, we expect a model that has sparse Hessian field (if one 
would plot the Frobenius norm of the local Hessian (see formulas (JSj) , (j28|) and (j30|) for explicit 
expressions) the sparse nature of this Hessian field would be apparent). In other words, the 
second derivatives of the model will be mostly zero. This means that the output model will be 
piecewise linear. 

The six imaging models (H])--(j9|) give rise to qualitatively different reconstructions. The as- 
sociated minimization problems ([2]) or ([3|) however, can all be solved using the same iterative 
algorithms, with minor variations. The next section contains a description and numerical com- 
parison of several suitable algorithms. Section [5] contains the technical details on how these 
algorithms can be applied to the penalized problems discussed above. 

3 Algorithms 

In this section we review a number of iterative algorithms that can be used for solving problems 
([2]) and ([3]) and make a numerical comparison of them. We start with some known algorithms 
for the penalized problem ([2]) in Subsection 13.11 

In Subsection 13.21 we perform the same kind of comparison for two iterative al gorithms for 



the co nstrained problem Q . One is a primal dual hybrid gradient algorithm of lEsser et al 



20101 ] and the second one is, as far as the authors know, new. A proof of convergence of the 



second algorithm is therefore included in Section \5\ 

The algorithms presented generally involve matrix-vector multiplications and vector space 
operations (addition, multiplication with scalar). Furthermore they also involve a simple convex 
projection operator P\ or a simple soft-thresholding operation S\. These are defined component- 
wise by P\((wi, .. . ,w N )) = (P\(wi),...,P\(w N )) a,ndS\((w±,... ,w N )) = (S\(wi),... ,S\(w N )) 
with: 

,A \\wi\\ > A 



P\(wi) = < M ,' and S x (w l ) = w i -P x (w l ). (10) 

I m \\Wi\\ < A 

Here Wi may itself be an element of R, M. 2 , .... We refer to Section H] for some more information. 
The precise details may vary depending on wether one treats a 2D or 3D problem, or the precise 
form of the penalty (TV, Huber TV etc). 

Below we will use the symbol ||^4|| 2 to denote the largest eigenvalue of A T A, and \\K\\ 2 to 
denote the largest eigenvalue of K K. Apart from the model variable it, the iterative algorithms 
below also use one or several auxiliary variables u, w (which is the subgradient of A|| • ||i), etc. 
The starting point of each of the algorithms in Subsections 13.11 and 13.21 is arbitrary. 
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3.1 Algorithms for penalized problems 



The generalized iterative soft-thresholding algorithm Loris and Verhoevenl . 120111 ] : 



u 



n+l 



+ T X K T {y - Ku n ) - Tl A T 



w 



10 



n+l 



II 



n+l 



P x (w n + %Av 



n+l 



(11) 



u n + T 1 K T (y-Ku n )-T 1 A T w n+1 , 

converges to a minimizer of the penalized problem ([2]) if step sizes n and t<i are chosen as 

iri||i^|| 2 < 1 andr 2 P|| 2 < 1. 



When applied to problem ([2} the algorithm of IChambolle and Pockl 20111 ] takes the form: 



w n+1 

n+l 



V 



U 



n+l 
-,n+l 



Px 

1 
1+ff 

u 
u n+1 + . 



w n + a^Au r 



v n + j^(Ku n -y) 



1+0 



II 



TlK T v n+l _ nA T w n+l 



(12) 



.n+l 







It converges to a minimizer of problem ^) for 9 = 1 and a\\TiK K + T2j4 r A|| < 1. 
An explicit Bregman algorithm: 



,n+l 



U 



+ nK T (y - Ku n ) - nA T [w n + \- 



w 



w 



w 



n+l 



(1 - 6)w n + 6P X [w n + ZLAu n+1 



- 1 )] 



(13) 



conve rges for < 6 < 1 and UriX^if + T2^4 T A|| < 1 to a minimizer of problem (|2|). It is found 
from Zhang et all l201ll . Equations 5.9, 5.10 and 5.11]. 

The conditions for convergence on K and A for algorithms (|12p and (|13p are different from the 
condition in algorithm (|lip . In the last two algorithms K and A are coupled in a single condition, 
whereas two separate conditions are used for algorithm (jlip . In other words, the conditions in 
algorithm (|11|) only depend on the matrix norms of K and A, whereas in algorithms (|12|) and 
(|13p the conditions also depend on the orientation of the singular vectors of K with respect to 



the singular vectors of A. This makes algorithm (jlip a little bit easier to use. Also, no k is 
found in front of K Kin the conditions for algorithms (|12p and (|13p . 

It is also worth noting that for a = 1 and 6 = 1 the algorithm (|12p reduces to algorithm (|13p 
for = 1. We omit the details of this calculation. 

The above three algorithms are fully explicit: they only require the application of the matrix 
K and its transpose at every step, and the application of the operator A and its transpose. 
The nonlinear operator P\ also has a simple implementation (see also Section U] for explicit 
expressions). 

In addition to the three explicit algorithms above, we also include the following implicit 
algorithm in our comparison (here implicit means that a linear system needs to be solved in 
every step): 

{K T K + aA T AY l (K T y - A T (w n - az n )) 

S x/a (Au n+1 + w n /a) (14) 



v n+l 



W 



n+l 



W 



+ a{Au r 



y n+l^ 



for < 9 < 1 and a parameter a. It is the so-called split-Bregman method [Goldstein and Qsher 
2009J . 



The split Bregman method was used in a seismic tomography context in Gholami and Siahkoohi 
20101 ]. The method seems appropriate for some special matrices K and A, for which the inverse 
(K K + a>A A)~ l is easy (such as e.g. the combination convolution K and A =grad which can 
both be diagonalised in Fourier space) or when legacy code exists: If K and A have no special 
structure, the first line in the split-Bregman algorithm (|14p itself needs an iterative algorithm. 
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The above four algorithms are compared in the framework of the TV reconstruction of 
Section [21 i.e. using the matrix K and the synthetic data y of Secion[2]and the choice A =grad. 
A reference minimizer u of problem ([2]) was first obtained using 10 5 iterations of algorithm 
(fTTl) with n = 1.99/ 1| if || 2 and r 2 = 0.99/ P|| 2 . This was done for two values of A. One value of 
A yields a reconstructed model that fits the data to noise level \\Ku T — y\\/\\n\\ ~ 1, and another 
(larger) value of A that underfits the data: ||.ffK ref — y||/||n|| ~ 3. In this way, it is possible to 
evaluate the performance of the algorithms for different values of A. A larger value of A would 
be used for cases with more noise; a smaller value of A would be used for cases with less noise. 

In Figure El panels (a) and (b), the evolution of the relative error \\u n — u mi ||/||u ref || to 
the true minimizer is plotted as a function of computing time for these four algorithms. The 
algorithms ran for 1000 iterations each (starting from zero model), except the split-Bregman 
algorithm which ran for only 200 outer iterations (and 5 inner iterations of conjugate gradient 
type). On the horizontal axis computing time is used rather than number of iterations as the 
split-Bregman algorithm's outer iteration step is more expensive than a single step of the explicit 
algorithms. Indeed, the explicit algorithms use a single application of the matrices K, K T , A 
and A in each iteration whereas the implicit split-Bregman algorithm needs to solve a linear 
system in each step. The times mentioned in Figure [7] were obtained for this specific example 
on a single 2.66GHz CPU with 12GB memory running Matlab R2009a. They cannot easily be 
extrapolated to other problems (with different numbers of variables, data, other matrices K 
and A, different hardware). Panels (c) and (d) show the evolution of the functional Q as a 
function of time. For algorithms (jlip and (|12p . one can show that the functional (El) tends to 



its limiting value as Q(l/n ) where n is the number of iterations Loris and Verhoevenl . 12011 



Chambolle and Pockl . 1201 II ]. There is no such result for the error \\u n — u\\. 

The algorithms that perform the best for the smaller value of A perform the worst for the 
larger value of A. Moreover we see that the smallest value of \\u n — u \\/\\u \\ does not 
necessarily correspond to the smallest value of F(u n ) — i ? (u ref ). We conclude that all algorithms 
mentioned are suitable for solving problem ([2]), but that convergence may depend on a good 
choice of the step size parameters used. For algorithm (fTTl) the larger step size n = 1.99/||K|| 2 
is suitable for cases with low noise, whereas the choice T\ = 0.99/||-fC|| 2 is better in high noise 
cases. Generally, the error between u n and the true minimizer lies between 1 and 10% after 1000 
iterations. 

It was already remarked that these algorithms do not give sparse Au n at every iteration 
step; only for the limiting value of u°° will Au°° be sparse. When A is the unit matrix or an 
orthogonal matrix, then the generalized iter ative soft threshdolding algorithm (flip reduces to 



the traditional soft-thresholding algorithm of lDaubechies et al.l 20041 ] which does produce sparse 
u n at every step: 

u n+1 = S Xri (u n + T 1 K T {y-Ku n )) for min -\\Ku - yf + A||d|i (15) 

' u 2 

for step sizes ti||ET|| 2 < 2. Su ch a simplified alg orithm could be used for sparse recovery in 



a wavelet basis as was done in Loris et al.1 . 120071 ] in a seismic tomography context. It suffices 



to make the change of variables u = W~ l w in expressions ([2]) and ([9]) to use algorithm (|15p 
with K replaced by KW _1 and u n replaced by w n . Such a change of variables is not possible 
for the TV penalty as A = grad is not invertible. An accelerated (more efficient) version of 
alg orithm (1151). the so - called Fast Iterative Soft-Thresh olding Algorithm (FISTA), is described 



m 



Beck and Teboulld . 120091 ] (for T\\\K || 2 < 1). See also lYamagishi and Yamadal 2011 ] for some 



recent developments. 

For K = 1 (or orthogonal) and t\ = 1 the algorithm (|lip reduces to the following projected 
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for large A 




10 15 

time (s 

(c) (F(u n ) - F(u ref ))/F(u ref ) for small A 
10° 



10 15 

time (s) 

(d) (F{u n ) - F(u ret ))/F(u ref ) for large A 
10 2 




10 15 

time (s 

GISTA, n = 0.99/\\K\\ 2 , r 2 = 0.99/\\A\\ 



10 15 

time (s) 



GISTA, n 



Expl. Bregman, n = 0.92/||K]| 2 , r 2 = 0.92/||/\|| 2 , 

1 .99/\\Kf, r 2 = 0.99/||/A|| 2 Split Bregman, 5 inner iterations, a = ||/<|| 2 /||/\|| 2 



= 1 



Chambolle-Pock, n = 9.9/||K"|| 2 , r 2 = 9.9/\\A\\ 2 , a = 0.094 

Figure 7: Convergence rate of the iterative minimization algorithms of Section [3TTJ with relative 
error to the true minimizer on the top and value of the functional on the bottom. The right 
hand side column refers to an experiment with a larger penalty parameter and left hand side 
column refers to an experiment with a smaller penalty parameter. The best algorithm depends 
on the value of the penalty parameter A, on the step sizes used, and on the criteria (distance to 
true minimizer or value of functional). Computing time instead of number of iterations is used 
for fair comparison. 
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gradient algorithm: 



w 



n+l 



II 



n+l 



K T y- 



+ T 2 A(K T y - A T w n )) 



A T 



w 



n+l 



for 



nun— it 

u 2 



y|| 2 + Apu||i 



(16) 



for st ep size T2||^4|| 2 < 2 [Chambolle! . 120051 ] . This algorithm can also be accelerated Nesterovl . 
1983]. The case K = 1 and A = grad corresponds to denoising with a total variation penalty. 
A recent comprehe nsive comparison of various numerical algorithms for this task can be found 
in Section 6.2.1 of Chambolle and PockU201ll ]. 



3.2 Algorithms for constrained problems 

In this subsection we compare two explicit iterative algorithms for the constrained minimization 
problem ([3]). This problem depends on the parameter e which determines the desired data misfit: 
ll-Kx — y|| < e. Such a formulation of a reconstruction problem is therefore useful when one wants 
to fit the data to noise level. It then suffices to set the parameter e equal to the norm of the 
noise vector (Morozov discrepancy principle). 

In addition to the projection operator P\ the constrained algorithms below also use another 
projection operator Q y ^ e defined as: 



QvA v ) 



y + e 



\v-v\\ 



for 
for 



y\\ > e 
y\\ < e 



and 



T vA v ) 



Qy,e(v) 



;i7) 



given y and e. In other words, Q y ^ e is the projection on the £2 ball of radius e centered at y and 
Ty i€ is an associated thresholding function. 

We start wi th the so-called p rimal-dual hybrid gradient algorithm (PDHGMp) found in 
equation 5.3 of [Esser et all I2010I ]. In the present notation this algorithm takes the form: 



u 



n+l 



W 



n+l 



U 
P 



T 1 K T (v n + (v r 



,n-l 



)/9) - n^ T (u; n + (w r 



w 



n-1 



)/0) 



M/ti 



,n+l 



w n + ^Au n+1 



(18) 



(1 - 6)v n + 0T y>e (v n + Ku n+1 ) 



and converges for < 9 < 1, \x > 0, \\ t-\K t K + T2 A t A\\ < 1. This algorithm can also be derived 
from algorithm (A-\) on pa ge 28 of Zhang et all 120111 ] and (for 6 = 1) from algorithm 1 in 



Chambolle and Pocku201ll ]. We omit the details. 



Another algorithm is given by the formulas: 



f ^n+l 

w n+l 

u n+l 
„,n+l 



U 



P 



nK T (v n + (v r 



w n + ^Au n+1 

n 



u n - t\k t \v n + (v 
(1-1 



„n— 1 



,n-l 



)/0) - T X A T W r 



)/6) - T!A T W n+ 



T„„n+1 



(19) 



v n + 9T y)e (v n + Ku n + 



) 



and it is proven in Section [5] that it converges to the minimizer of problem ([3]) for < 9 < 1, 
fi > 0, ti|| if j| 2 < 1 and T2||^4|| 2 < 1. We will refer to this algorithm as the generalized basis 
pursuit denoising algorithm (GBPDNA). The motivation for this name is given at the end of 
this Section. 

In Figure [S] the two algorithms are compared numerically on the same problem as in Subsec- 
tion [3?T] (i.e. same K, A and y). Moreover, we chose two values of e corresponding to the value 
of ||ifn rcf — y|| taken by the two reference minimizers of the simulation in the Subsection 13. 1[ In 
other words, we chose e so that the corresponding minimizers are identical to the minimizers of 
the numerical simulation in Subsection 13.11 Because of this choice of e it is possible to directly 
compare panels (a) and (b) of Figure [8] with panels (a) and (b) of Figure [71 
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(a) ||u n -u ref ||/||u ref || for small e 
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(b) ||u n -i/ ef ||/||[/ ef || for large e 
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GBPDNA, t-1 = 0.99/IIKH 2 , r 2 = 0.99/||-4|| 2 , (i=1x ||K' T y|| 00 , = 1 
GBPDNA, n = 0.99/||/C|| 2 , r 2 = 0.99/||4|| 2 , /u = 0.1 x ||K" T y||°o, 9 = 1 
PDHGMp, n = 0.92/||/<|| 2 , r 2 = 0.92/||^|| 2 , M = 1 x U^ylU, 6 = 1 
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Figure 8: Convergence rate of two iterative minimization algorithms for the constrained problem 
([3]). Pictured is the relative error to the true minimizer (obtained from algorithm f)19|) and 10 5 
iterations) . The figure on the left hand side is for a small value of e and the one on the right is 
for a larger value of e. Each algorithm was run for 1000 iterations. 



We recompute the true (reference) minimizer u with 10 5 iterations of algorithm (|19p with 
r\ = 0.99/||-ftT|| 2 , T2 = 0.99/||^4|| 2 and [i = ||i ; C T y|j 00 . We verified that the resulting minimizer is 
equal (up to 0.1%) to the reference minimizer of subsection 13. II Then we compare the iterates of 
algorithms (|18p and (|19p to these reference minimizers. We clearly notice that both algorithms 
are almost identical in their convergence behavior. The effect of an appropriate choice of the 
parameter [i in these algorithms is also visible in Figure [8) We conclude that there is no 
significant difference in convergence speed between these two algorithms (|18p and (|19p for the 
constrained minimization problem (J3]) . In practice it is slightly easier to work with two conditions 
of type Tilli^H 2 < 1 and T2||A|| 2 < 1 than one condition of type \t\K t K + T2^4 T j4|| < 1. 

The constrained problem (J3j) reduces to the so-called 'basis pursuit denoising' problem: 



arg mm ||w||i 

\\Ku—y\\<e 



(20) 



when A = 1, and to 'basis pursuit' [Chen et al.1 . 1199 



arg mm ||u||i 

Ku=y 



(21) 



when A = 1 and e = 0. In the case of problem (f2"Uj) the proposed algorithm (fT9"j) reduces to: 

n+l c* (n.n ^-. TfT ( „,n i ( n ,n „,n—l\ 



,n+l 



S»(u n - nK 1 (v n + (v n - v" 1 - 1 )^)) 



(1 - 9)v n + 0T y ^ (v n + Ku n+1 ) 
and in case of problem (f2Tp the proposed algorithm ([19]l reduces to 
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,n+l 



S^(u n - r x K T (v n + (i 
(1 - 9)v n + 6{v n + Ku 



,n-l 



)/*)) 



n+l 



y)- 



(22) 



(23) 



The last special case (|23p is the same as algorithm 5.6 of Zhang et all 120111 ] . For this reason 
we will call algorithm (I19p the generalized basis pursuit denoising algorithm (GBPDNA). 

In algorithms (|22|) and (|23|) the step size parameter t\ satisfies ri||ii'|| 2 < 1. As a result of 
the soft-thresholding, the u n are sparse in every step. 
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4 Explicit formulas for the algorithms used in Section [2] 

The iterative algorithms of Sections l3,ll and l3,2l were compared in the framework of TV -penalized 
seismic recovery (equations ([2]) and ((U)). The same algorithms, with minor modifications, can 
also be used to solve the other three problems of Section [2] as well. In this section we give some 
explicit formulas for the expressions encountered in these penalties and algorithms. The details 
depend e.g. on the number of spatial dimensions (2D, or 3D), on the precise expression of the 
differencing matrix A in functionals ([2]) and ([3]), etc. 

For the TV regularized problem (Jl|) we set A = grad, which we can define as grad(u) = 
(A x u, A y u) and : 



(&xU, 
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u i+l,j 
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1 , . f Ui .,-4-1 — Ui i j:l...N— 1 



(AyU 



ywji,j 
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4 'J 



i : 1 . . . N 

j = N 



(24) 

(i,j:l... N) for u € M 7VxiV . So for a 2D model -u € M> NxN , the gradient field grad(n) will be in 
x \ v.: rp^ transpose f ^4 jg then given by the formulas A T (i 



(Alw x ) id 



and 






x,i,3 



+ W X 



-l.j 



(^ 
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~ w y,i,j "r w y,iJ-i 
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[W X ,Wy) 
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i = 1 

i : 2 . . . JV 

i = i 



A^w x + AyWy with: 



1 



(25) 



(i,j : 1 . . . iV~) for a w x ,w y E M iVxiV . With these definitions one has that (Au, w) = (u, A T w) for 
all u £ R NxN and all w G R^^x 2 . 

These operations are easy to code (e.g. in MATLAB) and the extension of the above formulas 
to 3D models (u G M_ NxN xN ) j s straightforwa rd. In the examples of Section[2]we used the cubed 
sphere parametrization of lRonchi et al.1 19961 ]. The formulas for grad (or A^ and A y ) and grad 
(or A^ and A^) are the same as in (|24p and (f23|) . except that other boundary conditions are 
used to ensure the correct behavior at the edges of the six ('square') faces that make up the 
parametrization of the sphere. 

For minimizing the functional (JU) the generalized iterative soft-thresholding algorithm (|lip 
was used. The nonlinear operator P\ appearing in it, in this case becomes in accordance with 
formula (1101): 



P\(w x ,w y ) 
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u> 2 + u; 2 



--(W X ,Wy 



it; 2 + ui 2 > A 



(26) 



{W X ,Wy 



W 



2 i w 2 < X 

x ' ^v — ' 



for (w x ,w y ) € M 2 . Here we have dropped the double subscript i, j £ {1 . . . iV} for clarity. 

For the Huber TV regularization method, one uses the same expressions for A = grad and 
A T as in the TV case, but the operator P\ has to be replaced by: 

—. :(w x ,w y ) Jw%+w%>\ + a 

// * (27) 



P\,a(u! x ,W y 



w\ + W 2 



A 



{w x ,w y ) 



■iv: 



+ w?,<\ + 



o. 



again applied in every pixel. With this minor modification the algorithms of Section [3] may also 
be applied to problem (JSJ. The convergence is still guaranteed under the same conditions as in 
Theorem [1] (see also note after proof in Section [5]) . 
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For the problem ([8]), we have set Au = Hess(u) with 

Alu A x A v u 



Hess(u) 



A y A x u A 2 y u 



for u G 



xN 



i.e. Hess(u) G 



x7Vx2x2 



. Its transpose A T is given by the formula: 



Hess T H = {Alfwu + AlA T x wi2 + A^A^w 21 + (A^) 2 w 22 



(28) 



(29) 



acting on a w G flj" x " x2x2 ; with A^ and Ay defined as in ()25p . The Frobenius norm of the 
local Hessian, as used in penalty ©, is: 



#11 #12 
#21 #22 



ff 2 + fl-2 + H 2 + H 2 
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21 



'22 



(30) 



for If £ I 2x2 . It is equal to \J o\ + a\ where a±, a 2 are the singular values of H. In formula 
the sum over all pixels of expression (|30p is taken, i.e.: 



\\\Axh 
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(31) 



The advantage of using this matrix norm lies not only in the fact that it is easy to compute but 
also in the fact that the penalty becomes isotropic. In other words, a rotation of the coordinate 
axis, does not change this expression (the rotation (x',y') = R(x,y) leads to H' = RHR T 
and therefore ||#'||f = ||#||f)- Other spectral matrix norms (i.e. norms that are based on 
the spectrum of H), besides the Frobenius norm, also lead to isotropic penalties. However, 
the Frobenius norm is particularly easy to use as it does not require an explicit singular value 
decomposition of each H (in each pixel) to compute it. And, as we shall see, the associated 
projection P\ is also easy to compute. Note that e.g. the penalty ^-- |(Aw)ij| is also isotropic. 
It leads to models where the laplacian Au of u will be mostly zero, i.e. models that are piecewise 
harmonic (not piecewise linear). 

In the case ([8]) the expression for P\ used in algorithm (jlip is 
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(32) 

for (ion, w\ 2 ; w 2 \,w 22 ) G M 2x2 . Again we have dropped the double subscript i,j for clarity but 
it is understood that the above formula should be applied in every pixel. As already mentioned, 
this expression for P\ does not require a singular value decomposition of H in every pixel. 

In case of the total generalized variation penalty ([7]), one has a functional of the form 
^H-ftTu — y\\ 2 + A(||grad(u) — v\\\ + a\\Dv\\\). It can therefore be treated with a combination of 
the above formulas for P\, grad, A x , A y , etc. The matrix A in algorithm (jlip now takes the form 

I nnrl arts nn (11. iA whprp 11 is an n.nviliarv variaTilp in H$-"'XiVX2 

aD 



and acts on (u, v) where v is an auxiliary variable in 



The operator 



D acting on v is a derivative: Dv 
involves therefore: 



A x v y 



A y v x 

AyVy 



The explicit expression for penalty ([7]) 



pixels 



^{A x u - v x ) 2 + (A y u - Vyf + a^{A x v x ) 2 + {A y v x ) 2 + (A x v y ) 2 + (A y v y )< 



(33) 
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Name 


1 output 


TV 


0.20643 


Huber-TV 


0.22519 


TGV 


0.20843 


HP 


0.20033 


Haar 


0.39789 


CDF 4-2 


0.25248 



^inputll /ll^input II ||^ u output 



\n\ 



1.0009 
1.0041 
1.0003 
1.0080 
1.0034 
1.0059 

Table 1: Reconstruction results for the six reconstructions of Section [2 All reconstructions fit 
the data equally well (third column), and have slightly different reconstruction error (second 
column). 



This penalty is denoted by -isymTGV in [Bredies et al.l . 120101 ] . The explicit expression for the 



nonlinear projection operator P\ used in the iterative algorithms now combines both expressions 
(|27p (for the first line of A) and (|32p (for the second line of A). In this case the norm \\A\\ depends 
on the parameter a. 

For the simulation of Section [21 the algorithm (|lip was implemented four times with the 
above definitions of A and P\. For those cases, we chose t\ = 1.99/||-PC|| 2 and t^ = 0.99/||A|| 2 
and 1000 iterations were performed. 1000 iterations correspond to about 20 seconds of computer 
time on a single 2.66GHz CPU with 12GB of memory, running Matlab R2009a. The relative 
error with the input model, ||x output — x mput ||/||x input ||, is given in Table [1] for each of the six 
reconstructions. Two sparse wavel et reconstruction were als o calculated in Section [2 Here 1000 



iterations of the FISTA algorithm Beck and Teboulld . 120091 ] were used with appropriate wavelet 



transform W (see formula ([9]) and algorithm (|15p with K — > KW^ 1 ). 

None of the six output models *u out P ut is expected to be exactly equal to the input model 
u mpnt because of the lack of data, the noise on the data and the effect of the penalties on the 
reconstructions . 

The computational complexity of the algorithms depends on the matrices K and A (or W) . 
It was already mentioned in Section that the iterative algorithms discussed in this paper use 
a single application of K, K , A and A in every step (except for algorithm (|14p ). The matrix 
K encodes the relationship between the model u and data y. In case K is a dense matrix, a 
single application of K (or its transpose) requires 0(mN) operations. Here N is the number 
of components of u and m is the number of data. If the matrix K is sparse (as is the case 
for the examples in this paper), or has structure (as e.g. in deconvolution) then this can be 
reduced significantly. If the matrix A is chosen as a local differencing operator, then applying 
A (or its transpose) requires O(N) operations. The same is true for a wavelet transform W. 
The convex projections P\ are also O(N) operations (as they are applied componentwise). The 
operator T y ^ used in the constrained algorithms ()18p and ()19p has computational complexity 
0(m). The computing times mentioned in this paper do not serve as a basis for extrapolation to 
other situations; what is important here is that the application of such A and P\ typically takes 
less time than applying K and K . This is what makes the application of the edge-preserving 
penalties of Section [2] possible in practice. 



5 Proof of convergence of constrained algorithm (1191) 



Without loss of generality we set t\ = t<i = 1 in algorithm ()19p and prove convergence for 
||iT|| < 1 and \\A\\ < 1 (the step sizes can be introduced by scaling the matrices K, A and the 
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data y). The algorithm (|19p can thus be written as: 



T„„n 



A 1 W 



W 



n+l 



„n+l 



P M (w n + ylu n+1 ) 

u n _ K T (v n + Ku n - z n ) - A T w n+l 



n+l 



(34) 



Qy i£ (Ku 
v n + 9(Ku 



+ v r 



n+l 



y n+l> 



with ||iT|| < 1 and ||^4|| < 1. Indeed, the variable z n+1 can be eliminated from the last line of 
Plj) to yield the last line of (fT9l) : 



,,n+l 



■; n + (Ku 



n+l 



y n+l 



) 



(1 - 9)v n + (& n+1 + v n - Q y , e {Ku n+l + v n )) 
(1 - 9)v n + 6>r y , e (Ku n+1 + v n ) ' 



n 



1) in 



where we used T y>e (a) = a — Q y ^(a). Substitution of the last line of (fM)) (with n 
the first line of algorithm (|34p then yields the first line of algorithm (|19p . 

The minimizer of problem ([3j) is determined by its variational equations. These are derived 
as follows. Introducing a positive parameter fj,, we write problem ([3]) as min u /u||.Au||i + L(Ku), 
where I is the indicator function of the £2 ball of radius e around y: I{Ku) = for \\Ku — y\\ < e 
and I(Ku) = 00 for \\Ku — y\\ > e. The variational equations of the constrained problem @ 
are therefore 

0, (35) 

|i at Au and v an element of the subdifferential 



A T w + K T v 



with w an element of the subdifferential of u\ 
of / at Ku. 

The subdifferential w of /z||.Ax||i satisfies u>j = fj, (Au)i/\(Au)i\ for (>l«)j 7^ and |u>j| < \x 
for (Au)j = 0. This means that (Au)i = S^(wi + (Au)i) or equivalently, using (fTUj) . that 
u)j = P^Wi + (Au)j), which we write as w = P^(w + Au). 

Similarly one shows that the subdifferential of v of I at Ku is characterized by the relation 
Ku = Q Vtt (Ku + v). The variational equations that determine the minimizer of problem ([3]) can 
therefore be written with an auxiliary variable z = Ku as: 



w = P^ (w + Au) 

u = u — K (v + Ku 

Z = Qy,e (KU + V) 

v = v + (Ku — z) , 



A T w 



(36) 



which correspond to the fixed point equations of iteration (|34p when 0^0. 

Lemma 1. Let Pc be a projection on a non-empty closed convex set C C R . 
1 . Ifu+ = P c (u- + A), then 



Let u^ 



u 



-||2 



A) 



u \\~ < \\ u — ""IT - || u ^ — u I '' - 2(n — u 
for all u € C. 
Proof. Because Pc is the projection on the convex set C we have: 

{u-P(u'),u -P c (u')) <0 
for all u G C and all u'. Setting u' = u~ + A and Pc(u') = u + in this inequality yields 

A - u + ) < 0. 



,Ae 



(37) 



u ,u 



As (u — u + ,u —u + ) 



u — u 



+ \\u 



u 



\u 



u 



-||2 



) /2 we find the relation d37J) . □ 
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The proof of the follow ing theorem i s a com bination of the proof of theorem 2 in Loris and Verhoevenl . 



20111 ] and theorem 4.2 of [Zhang et all 12011 ]. 



Theorem 1. In a finite dimensional setting, and when \\K\\ < 1, \\A\\ < 1 and < 9 < 1, the 

iteration [3$ converges to a fixed point and provides a minimizer of problem ([3J] . 



Proof. It was already remarked that the fixed point equations (|36p correspond to the variational 
equations for problem ([3]). The problem ([3]) has a solution and therefore a solution (u,w,v,z) 
to the equations ([36]) exists. 

Taking into account line 2 of algorithm (|34p and lemma [T] with u + = w n+1 , u~ = w n , 
A = Au n+1 , u = w, and Pc = Pu, we find that: 



\w 



n+1 - wf < \\w n - wf - \\w n+1 - w n f - 2(w - w n+ \A{u n+l - A T (w n - w n+1 ))) 



where we have also used u n+1 = u n+1 — A T (w n — w n+l ). Similarly one finds from the first line 
of (|36p and application of lemma [TJ (with u + = w, u~ = w, A = Au, u = w n+1 and Pc = Pfi) 
that: 

\\w-w n+1 \\ 2 < \\w - w n+1 \\ 2 - \\w - wf - 2(w n+1 - w, Au). 

Together these two inequalities yield: 

II n-\-'\ ^Il9 ^ w n ~I|9 II rj-4-1 rjll9 

\\w + —w\\ < \\w — w\\ — \\W + — W \\ 

-2{w - w n+1 ,A ((u n+1 - u) - A T (w n - w n+1 ))) 
= \\w n - wf - \\w n+l - w n f + \\A T {w n+1 - w)f (38) 



+ \\A T (w n+1 - w n )f - \\A T (w n 
-2(w - w n+1 , A(u n+1 -«)). 



w 



|2 



In the same way, the third line of algorithm ()34p and lemma [T] (with u + = u n+1 , u 
A = -K T (v n + Ku n - z n ) - A T w n+1 , u = u, P c = Id) implies 



u 



- uf < \\u n - uf - \\u n+1 - u n f - 2(u - u n+1 , -K T (v n + Ku n - z n ) - A T w n+l ) 

and the second line of ()36p with lemma \T\ (with u + = u, u~ = n, A = —K(v + Ku — z) — Aw, 
u = u n+1 , Pc = Id) implies: 

||u - n n+1 || 2 < ||u - n n+1 || 2 -\\u- uf - 2(u n+1 - u, -K T (v + Ku - z) - A T w), 

such that together they yield: 

||u n+1 --u|| 2 < \\u n - uf - \\u n+1 - u n f 

-2(u - u n+1 , -K T (v n - v + K(u n -u)- (z n - z)) - A T {w n+1 - w)) 
< \\u n - uf - \\u n+1 - u n f 

+2{K(u - u n+1 ),K(u n - u)) - 2{K{u - u n+1 ),z n - z) 
+2{u - u n+1 ,K T (v n -v) + A T (w n+1 - w)) 
= \\u n - uf - \\u n+1 - u n \\ 2 



-\\K(u n+1 - u)f + \\K(u n+1 - u n )f - \\K(u n - u)f 
+ \\K(u n+1 - u)f - \\z - z n - K(u - u n+1 )f + \\z n - zf 
-2(u - u n+1 , -K T (v n -v)- A T (w n+1 - w)) 
\u n - uf - \\u n+1 - u n f + \\K{u n+1 - u n )f 

12 lis -,n js(„", „,n+l\||2 , \\ „n s||2 



-\\K(u n - u)f -\\z- z n - K(u - u n+1 )f + \\z n - Z\\ 
-2{u - u n+1 , -K T (v n -v)- A T (w n+1 - w)). 

(39) 

And from the fourth line of algorithm (|34|) and lemma [1] (with u + = z n+1 , u = 0, A = 

Ku n+l + v n , u = z and Pc = Q y ,e) one finds: 

|| z n+l _ ~||2 < || Q _ £ ||2 _ || z n+l _ Q ||2 _ ^ _ ^n+1^ K/ +l + y ny 
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Similarly, from the third line of (|36|) and lemma [T] with u + = z, u = 0, A = Ku + v, u = z n+l 
and Pc = Q y ,e one has: 



y n+l 



z\r < \\o-z 



n+li|2 



z-0\\ z -2{z n+i -z,Ku + v) 



which together yield: 



2\\z 



n+1 



z\\ z < 2(z n+L - z, K(u n+l -u) + v n -v) 



n+1 



or: 



< 2(z n+1 -z,-(z n+1 -z) + K{u n+1 -u) + v n -v) 

.n+1 „1M|2 



\ z n+l _ £||2 



\z — z 
+2{z n+l -z,v n -v). 



n+1 + K(u n+1 - u)\\ 2 + \\K(u n+1 - u) 



\z n+1 -z\\ 2 



7— 2i|„,n+l _ „,n||2 

+2(z n+l -z,v n -v) 



r" ' - r" IT + \\K(u n ' •' - ■'/,! 



,n+l 



(40) 



where we have used the last line of (|34"|) and of (|36|) . 

Finally from the last line of (|34p and from lemma Q] (with u + = v n+1 , u~ = v r \ A 



8(Ku n+l - z n+i ), u = v and P c = Id) it follows that: 



\v n+1 -v\\ 2 < \\v r 



vf - IK +1 - v n \\ 2 - 29{v - v n+ \Ku n+l - z n+l ). 



In the same way it follows from the last line of the fixed-point equation (|36|) and lemma [T] (with 
u + = v, u~ = v, A = 6(Ku — z), u = v n+1 and Pc = Id) that: 



,n+l _ 6 ||2 < ll^n+l _ .„2 _ m - _ -,,2 _ g^n+l 



v,Ku — z) 



and together the last two expressions yield: 



-l|L,n+l_£||2 



last lines of (134b and J36I ) 



-llL.n 



i i; n -t;|| 2 -0- 1 ||t; n+1 -t; n || 2 

+2(v n+1 - v, K(u n+1 -u)- (z n+1 - z)) 



? _1 |lu" — ON 2 



-lll^n+l _ ^"112 



+2{v n+1 - v n , K(u n+1 - u) - (z n+1 - z)) 
+2(v n - v, K(u n+1 - u) - (z n+1 - z)) 



)-M\ v n _ g||2 + Q-l\\ v n+l _ „n||2 



+2(v n - v, K(u n+1 -u)- (z n+1 - z)). 



Adding inequalities 



(gDD and dHJ together yields: 



(41) 



\u n+1 -u\\ 2 



\ w n+l _ ^112 + Q-l\\ v n+l _ fl||2 < || u « _ fl ||2 



| u n+l _ u n|j2 



K{u n+1 - u)\\ 2 + \\K(u n+1 - u n )\\ 2 - \\K(u n - u)\\ 



n+l\||2 



+ \\z r - 



\z — z n — K(u — u 

n '•Hi 

\W — W\\ 

\A T (w n+1 - w n )f - \\A r (w n - w) 



w n+l _ w n||2 + \\A T (w n+1 - w) 

n\ ||2 II aT ( a ,,n 



z n+1 -z\\ 2 



-2 ll^ri+1 _ v n\\2 _i_ ^-l|| v ' 1 _ {)||2 _i_ Q — l II v n+l _ v n\\2 



(42) 



as all remaining inner products cancel. As \\K \\ < 1 and \\A\\ < 1 we can introduce real regular 
square matrices L and B such that L T L = 1 — K T K and B T B = 1 — AA T . The last inequality 
becomes: 



\L(u n+1 - u)\\ 2 + \\B(w n+1 



rirtlia + e-Hv^ 1 - V\\ 2 + \\z n+1 - Z\\ 2 



< \\L(u n - u)\\ 2 + \\B(w n - w)\\ 2 + O^Wv 11 - v\\ 2 + \\z n - i|| 2 



\L(u 



n+1 -u n )\\ 2 



\B(w n+1 -w n )\\ 2 



\z — z n — K(u — u 



n+l\\\2 



(43) 
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where we have used that -0- 2 \\v n+l - v n \\ 2 + 6- l \\v n+1 - v n \\ 2 < (for < 9 < 1). 

It follows from inequality (ji3]) that (u n , w n ,v n ,z n ) n is a bounded sequence and that a conver- 
gent subsequence (u nj ', u> nj ', v n i ', z n J ') j exists. We call the limit of this subsequence (-u^u^^t^t^ 

Summing inequality ()43|) from M to N — 1 yields: 

||L(u* - u)\\ 2 + \\B{w N -w)\\ 2 + O- 1 ^ - v\\ 2 + \\z N - z\\ 2 

< \\L{u M - u)\\ 2 + \\B{w M - w)\\ 2 + Q- 1 ]^ 1 - -0|| 2 + \\z M - z\\ 2 (44) 

" En=M (ll L K +1 " un )f + \\B{w n+1 - w n )\\ 2 + \\z - z n - K{u - u n+l )\\ 2 ) . 

It follows that the sum in the right hand side of this expression bounded (independent of N) and 
therefore that \\L(u n+1 —u n )\\, \\B{w n+1 — w n )\\ and \\z — z n — K{u — u n+1 )\\ tend to zero when 
n tends to infinity. As B and L are regular, this implies that ||u n+1 — u n \\ and ||tt> n+1 — u> n || 
tend to zero for large n. Then also \\z — z n — K(u — u n )\\ tends to zero for large n. The iteration 
(|34"|) and relations (|36|) imply that v n+1 = v n + 0(K(u n+1 — u) — (z n+1 — z)), and we therefore 
find that ||f n+1 — v n \\ — > 0. It follows also from the iteration (JMJ) and the previous remarks 

that: 

\\z n+1 -z n \\ = \\Q y>e {Ku n+1 + v n ) - Qy, e {Ku n + v™" 1 )!! 

< \\{Ku n+l + v n ) - {Ku n + v n ~ l )\\ 

< \\K(u n+1 -u n )\\ + ik-t/ 1 - 1 !! "-^0. 

We can therefore conclude that (u nj+1 ,w nj+1 ,v n -> +1 ,z n -> +1 ) — > (vr , w',v', z*) as well. This 
implies then that (u' , w', v>, z*) is a fixed point of the iteration (|34p and satisfies the equations 

HMD. 

We now choose (u,w,v,z) = (u* , w\v',z*) in inequality flUJ) and find that: 

\\ L ( U N _ u t)||2 + \\ B (w N - u;t)||2 + e- X \\v N - V^\\ 2 + ||** - zt|| 2 

< ||L(n A/ - ut)|| 2 + \\B{W M - wt)||2 + 6 -l\\ v M _ v t||2 + || Z A/ _ z f||2 

for A^ > M. As there is a convergent subsequence, the right hand side can be made as small 
as desired by choice of M, and it follows that the entire sequence (u n , w n , v n , z 11 ) tends to the 
fixed point (u^,w^,v^, z') for large n. D 

One can adapt the algorithm (|19p (or (|34p) and the proof to suit the problem min x H{Ax) + 
J{Kx) (with H and J two convex functions). In this case the projection operator P^ must be 
replaced by the proximity operator prox^* and the projection operator Q y ^ must be replaced by 
the proximity operator of J: proxj. As long as these (non-linear ) operators have simple expres- 



sions, the above algorithms are useable in those c ases as well. See ICombettes and Pesquetl . 120111 ] 



for a general review of proximity operators and JLoris and Verhoevenl . 120111 ] on how proximity 



operators were used for a generalization of problem (J2j) and algorithm (fTT]) . 

6 Conclusions 

A set of simple iterative algorithms for the minimization of a penalized least squares functional 
([2]) was presented and their convergence speed was compared numerically in a seismic tomog- 
raphy context {K ^ Id). For the examples considered, the comparison shows that all of these 
algorithm can produce the minimizer of the functional to within 1 to 10% after about 1000 
iterations. The 'best' algorithm depends on the step sizes, the penalty parameter A and on the 
metric chosen (functional or distance to minimizer). They can therefore all be used in practice 
for solving these kind of problems as they appear in seismic tomography. 

The four algorithms for minimizing the penalized functional ([2]) are not new. Here they 
are presented in a uniform notation making comparison and implementation easier. We also 
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presented and compared two iterative algorithms for the (equivalent) constrained formulation 
(|3|) of this problem. One of these algorithms is new and a proof of convergence is included. 

The advantage of the constrained formulation ([3]) over the penalized formulation ([2]) is that 
often the penalty parameter A has to be determined by trial and error so as to fit the data to 
the desired level ||iifu(A) — y\\ < e. This is done automatically in formulation (|3j). We have 
listed simple iterative algorithms for both approaches. However, if the operator A is invertible 
(and the inverse is readily available), then it can be more advantageous to make a change of 
variables in the penalized formulation and to use an accelerated algorithm (see discussion at end 
of Section IXTj) . 

For the readers' convenience we wrote down the explicit forms of several non-smooth convex 
penalties that can be used for edge-preserving regularization of a seismic imaging problem (total 
variation penalties and various generalizations) and that can be solved with these algorithms. 
We also wrote explicit formulas for the convex projection operators that are used by the iterative 
algorithms in these cases. 

We discussed the qualitative properties of these reconstructions on a synthetic seismic to- 
mography example. Particular emphasis was given to the role of sparse loc al differences. For 



the c ase o f image denoising K = Id with TV-like penalties one may refer to Strong and Chan 



2003] and Setzer et all 120111 ] . Still in the case of denoising, other penalties on local differences 



(other than ^i-norm type, including non-conv ex ones), with the goal of maintaining edges in the 



reconstruction, are compared numerically in Lukic et al.l . |2011 |. 
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